Surviving a Digging and the Futility of Statistics

Posted on April 27, 2006, 7:40 am, by Amy Hoy, under Home, metablog, rails.

Well, the after-effects of the digg are dying down. I’m elated at the exposure—I certainly didn’t expect it to snowball to the proportions that it did. And what proportions are those?, you might ask. Well, rather big ones, actually. What follows is a formidably long tale of intrigue, danger and, above all, lessons learned about the digg effect.

        <span id="more-7783"></span>

The Numbers

According to Mint, slash7 (all pages included) received 16,010 page views on Monday, when I released the Scriptaculous cheat sheet.

According to Webalizer, slash7 received 58,397.

And by now, 914 Diggs. I made number five on the front page for a while. I’m tickled pink to have finally graced the front page.

When Ruby Goes Rogue

My friend Davey and I spent a fair amount of time trying desperately to shore up the server’s defenses against such an onslaught of traffic. Unfortunately, nothing really worked. Ruby and Apache conspired like fiends to dominate the CPUs— they ate up CPU cycles like Takeru Kobayashi at a hot dog-eating contest.

In the end, I had to resort to putting up a static HTML file describing the mayhem — no images, no CSS, just the static HTML and Mint Javascript file. We redirected the download link to a CORAL cache to offload the 70K-a-pop hit on the bandwidth.

By the evening, traffic had died down enough to restore Typo. It was a thrilling ride, but I’d have enjoyed it more if the rollercoaster had been equipped with seatbelts.

Why Ruby Goes Rogue

Now, I know that a Rails application can hold up to this kind of load, because we see it every day. But my Typo install didn’t. I’m not going to point the finger at Typo because I suspect it was a combination of things, although I suspect Typo’s “feature completeness” may have contributed to the problem.

For one, the server’s a measly dual PIII—I don’t complain, because I get free hosting with root access, provided by an old and dear friend (thanks, Davey!). But a dual PIII is hardly the height of technology; we could get 1U rack for $1200 that’d be well more than twice as fast.

Secondly, I can’t seem to use page caching with Typo. Yes, I hear you heckling me back there. But when I turn on page caching, people can’t leave comments. I get a bizarre error that I haven’t been able to track down: an error about permission being denied. Sounds simple, right? But the file Rails allegedly can’t access is a CSS stylesheet. The permissions on which are, shall we say, quite liberal.

And so we come to the point I can’t avoid. Typo is a bit bloated. The bloat has made Typo run slower, and it’s also made it harder to track down bugs. We all know and love Typo, but maybe it’s time to stage an intervention.

Lessons Learned

Boy have I learned my lesson. You might learn from my pain, so here’s what I’ve learned:

First, fix any errors that might cause your server to fall to its knees and beg for mercy before you submit a link to Digg.

Secondly, if your aspirations include regular Digging, you better invest in a real server instead of taking hand-outs, however tasty free lunches may be.

Thirdly, definitely suck up to your friend who helps you survive a Digging with all sorts of arcane Apache wrangling. He deserves it.

Fourthly… don’t use Apache. I’m going to see if I can’t jury rig Lighttpd up before I try any such thing again.

Fifthly. It may be time to roll your (my) own. It’s not as if there isn’t proof that you develop a reasonably functional blog in 15 minutes, or somethin’.

The Saga Continues…

The whole thing just keeps on going—the Scriptaculous cheat sheet post has gotten 23,476 page views according to Mint; Webalizer says 26,475 (a closer comparison, this time). Mint doesn’t track files, but Webalizer says the cheat sheet itself was downloaded 11,373 times—and I have no way of tracking the downloads from the various mirrors that popped up, nor the HTTP redirection I had to set up to save the server from (more) spontaneous combustion.

All in all, I did 1.9GB of traffic in one day. Yesterday, I did 2GB of traffic.

The Futility of Statistics

But let’s compare those numbers again, shall we?

	Mint	Webalizer
Page Views (Monday)	16,010	58,397
Page Views (C.S. post)	23,476	26,475

Not a whole lot of synchronicity going on here.

Webalizer works like a typical 1990’s-era traffic analyzer: it parses log files. It was also obviously designed by nerds. It’s butt ugly and, frankly, it’s clear nobody gave a moment’s thought to good information design. If you want the wham, bam, thank you ma’am of web stats, then maybe Webalizer’s for you.

Mint, on the other hand, is much more nuanced. For one, it works via a Javascript include file (and PHP). The Javascript fetches all sorts of information about the user’s browser, and lets you see the information in a much more useful manner, indexed by page title instead of just URL, and letting you see which referrers went to what file on a limited basis. But it’s got downfalls in its UI, too, in my opinion.

But Mint’s main failing is that what makes it good also makes it bad: the Javascript. The file must be included by the user’s browser to even record the hit so RSS feeds and file downloads are out. And, in non-ideal circumstances—when your server is begging for its mommy, say—server load may be so severe that the file isn’t loaded. And of course, some people don’t run with Javascript on, but that seems to be a small percentage of the population.

Look, I Can Do Basic Math!

Why the difference?

Well, there’s the load problem that I described above. There’s also the questionable nature of Webalizer’s information. Like many old-school stats packages, it has some funny accounting. Webalizer does track hits separately from page views, but what does it consider a page?

I think it might be counting CSS files or Javascript “pages,” seeing as it reports nearly four times as many page views (all pages) as Mint does for the Day of Reckoning—although when you look at the specific page URL in question the difference is only 3,000 or 12%. This might be explained if Webalizer is tracking non-HTML responses.

Another option is that the 12% stems from the time when the server was too overloaded to send all the little files associated with the page, like Mint’s Javascript file. But that doesn’t explain the quadrupling of overall page views for the day.

Maybe Mint was just letting me down. I don’t know. Can’t know, really. Those of us who are statistics mongers by nature, well, we just have to learn to live with the uncertainty. It’s kind of zen, really.

As An Aside…

If you’re out there looking for a good web app idea to execute in Ruby, allow me to suggest a log file parser that doesn’t suck. You can’t get more reliable than parsing log files, but the fact is every package I’ve seen that does it (including the $800-a-head Urchin) just plain bites it. That’s why Mint ($30) is taking the world by storm, even though it’s PHP, and relies on Javascript, and certainly has failings of its own.

| Trackback

No Comments

jetrac says:

April 27, 2006 at 7:40 am

I would have assumed Webalizer was counting GETs to pages for page views. But really, even just a simple wc -l on your logs for your index page (if you have one) would tell you who is closer to the truth. To wit:

grep "/index" access_log | wc -l

See what that # is… and then see how many page views on your Index page Webalizer is showing vs. Mint. Are they close to what’s actually in the log file?
Doug Johnston says:

April 27, 2006 at 7:40 am

Great post Amy! It’s cool to hear all of the behind-the-scenes panic that goes on during an intense round of digging.

As for your Mint issues, you might want to try the Download Counter Pepper. It seems to track downloads of files quite nicely.

<a href=http://massiveblue.net/pepperminttea/2005/11/29/download-counter-pepper-v12/">Download Counter</a>
Joshua Jarman says:

April 27, 2006 at 7:40 am

Hi Amy, I got the pdf cheatsheet through a direct link you sent out in the RSS feed titled "Scriptaculous Cheat Sheet #1". Anyone who grabbed the pdf through the direct link would show up in Webalizer and not Mint. That alone could be the difference you are seeing.

I also second the suggestion to Pepper your Mint.

BTW. I quite enjoyed your talk at CoR, as well as your ObjectView article, cheatsheets, blog etc.

Thanks for everything!

Kind Regards, Josh
Ben Reubenstein says:

April 27, 2006 at 7:40 am

I agree, Typo is a beast. I have issues with comments all the time, and having to run it with 1.0 Rails is fun.

I would be interested to see what AWStats has to say about your logs. I have found that its stats line up with my Google AdWords information pretty well when it comes to impressions. I would be happy to process the log to see how far the rabbit hole goes, or if you want to roll your own, check out my tutorial on <a href ="http://www.benr75.com/pages/lighttpd_awstats_tutorial">setting up lighttpd and awstats</a>. Light would have definitely been help against the Digg onslaught!

L8r ~
Amy Hoy says:

April 27, 2006 at 7:40 am

Josh,

Yep, I’m not a total dweeb 😉 I know that Webalizer catches file downloads. But does it count them as "pages"? If so, that could definitely be part of it — it reports 12,000 or so downloads total. Unfortunately it doesn’t let you see which of those 12,000 are from which days. I don’t believe that that number includes the HTTP redirects to the mirrored file, though… Not really interested in grepping myself (sorry, jetrac 🙂

Ben,

AWStats looks a lot better than it did last time I looked at it. Is it still really hard to install? And sadly, I am still on Apache, and who knows when I’ll have time to get Lighttpd up (sigh!)
Joshua Jarman says:

April 27, 2006 at 7:40 am

Amy,<br />

I didn’t intend to imply dweebiness of any sort 🙂 Also, in retrospect I would have started my comment with a well deserved CONGRADULATIONS! It is a good day when your popularity pegs your server’s resources.<br />

May your success continue to increase until you can peg the resources on even the mightiest servers!

All the best,<br /> Josh
Dave P says:

April 27, 2006 at 7:40 am

Damn, and I was excited when my traffic peaked to 170 visitors due to Digg!

Interesting take on typo. I haven’t even looked at it, since I’m still on wordpress and <a href="http://run4yourlives.com/archives/2006/04/27/getting-started-with-ruby-on-rails-properly/">writing up my own blog</a> as a way to learn rails. I suppose you’d suggest I keep it nice and small then! 🙂
gleek says:

April 27, 2006 at 7:40 am

i also recommend the Dloads Pepper for tracking downloads. it seems to work great.

http://damonparker.org/blog/pepper-dloads/
Robert Merrill says:

April 27, 2006 at 7:40 am

Internally, we use Summary ( summary.net )- not as pretty as Mint – but works very well.

We are planning on adding visitorville, as a more user-friendly option for looking at stats. http://www.visitorville.com/
Jon says:

April 27, 2006 at 7:40 am

I use Typo on a high traffic site and have had no issues thanks to page caching.

Fixing page caching should have been priority #1 in your book IMO.
Buddy Jesus says:

April 27, 2006 at 7:40 am

We have tons of sites running Apache. Some push 1Gb/minute (not 1.9Gb/day). Millions of page views a day on a single server running Apache.

It’s silly to claim that something that works so well for others doesn’t work well. Clearly something is wrong with the Apache configuration or something else is the bottleneck — I’ll wager for the latter.
Daniel says:

April 27, 2006 at 7:40 am

I live dangerously with Typo-svn, and I find myself muttering about bloat as well (mostly concerning the gobs of sidebar "plugins" that actually aren’t, and should frankly be actual, external downloads). </rant>

There has to be a better way.
Buddy George says:

April 27, 2006 at 7:40 am

Yeah, I’ve run Apache and PHP servers serving fully dynamic content at multi-Gb per minute with no content caching since before PIIIs were on the market and memory still cost a lot.

It ain’t Apache that’s bringing ya down.
Amy Hoy says:

April 27, 2006 at 7:40 am

Buddies: I think it’s pretty obvious that I’m talking about it in the context of FastCGI. Lighttpd gets significantly better performance with FastCGI Ruby code than Apache.
Buddy George says:

April 27, 2006 at 7:40 am

Got some benchmarks for that?
Underpants Gnome says:

April 27, 2006 at 7:40 am

You could avoid the Ruby performance problems by linking directly to the PDF next time you submit to Digg.
Pascal Pensa says:

April 27, 2006 at 7:40 am

IIRC, Mint does not log bot access (due to the JavaScript) whereas Webalizer logs every request to every file on your system.

That all said, it is pretty obvious that the numbers are skewed. Again, IIRC, everytime someone points a link to your site/sheet and their system uses some ping-o-matic service, a whole bunch of bots will visit out your website time and time again.

Nice work with the sheet.

And +1 for the direct PDF link from Digg.
Tony Perrie says:

April 27, 2006 at 7:40 am

AWStats does indeed rock. Unfortunately, its evolution has slowed. There’s also a lesser known plugin called AWTotals. It gives you a yearly view of your data rather than the AWS’s monthly one. I’ve heard rumblings of a new stats collector in development by some Austin hackers, and I’m really interested to see what their solution might be.
Sergi says:

April 27, 2006 at 7:40 am

Why don’t you try analog instead of webalizer. I’ve used it for a long time and I’ve tested it againsts monters like webtrends or Google Analytics (which by the way tracks pages with javascript, like Mint) and the numbers are very reliable.
Punt Osteoarthritis says:

April 27, 2006 at 7:40 am

Good idea!
Detached Help Finding People says:

April 27, 2006 at 7:40 am

That is strange.
Shaun Inman says:

April 27, 2006 at 7:40 am

Sorry for the bump. Somehow I never saw this when it was originally posted back in April (I would have been in Iceland at the time).

If you are correct that Webalizer is tracking CSS and JavaScript files as page requests then those numbers make sense because Webalizer is likely counting the Mint JavaScript include to.

If you have 1 CSS file, 1 JavaScript file plus the Mint JavaScript file requested with every page that would appear as 4 hits in Webalizer and only 1 (correctly) in Mint.

Actually, even if it wasn’t tracking CSS and JavaScript, I don’t think Webalizer checks the MIME type of the files served. Mint, a PHP file, is accessed twice per visit (once to generate the JavaScript to collect visit data and again when that data is recorded). To Webalizer that might appear as two additional hits.